DOCUMENT RESUME 



ED 292 059 



CS 009 062 



AUTHOR 
TITLE 

INSTITUTION 
SPONS AGENCY 
PUB DATE 
CONTRACT 
HOTE 
PUB TYPE 

EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Thompson, Charles L.; And Others 
Speech Recognition Technology: An Application to 
Beginning Reading Instruction. Techni-^al Report. 
Educational Technology Center, Cambridge, MA. 
National Inst. o£ Education (ED), Washington, DC. 
Oct 85 

NIE~400-83-0041 
45p. 

Reports - Research/Technical (143) 
MF01/PC02 Plus Postage. 

Beginning Reading; *Computer Assisted Instruction; 
Elementary Education; *Reading Instruction; *Reading 
Programs; Science Programs; Speech Synthesizers; 
*Technological Advancement 

♦Computer Speech Recognition; Software Testing 



ABSTRACT 

Noting that the recent development of reliable, 
high-performance, low-cost speech recognizers — devices that can 
distinguish among spoken words — holds potential for education, such 
as early reading instruction, this technical report describes a study 
which investigated two principal questions: (1) Does an inexpensive, 
microcomputer-based speech recognizer perform reliably enough on 
young children's speech to permit application to reading 
instruction?; and (2) What are the main human factors attending such 
use? The Dragon System Mark II Isolated Word Speech Recognizer was 
used in the study, which included four stages. The first phase took 
place in June 1984 and involved 17 kindergartners; the second phase 
took place in November 1984 and involved 7 kindergartners and 8 first 
graders; the third phase took place in late December and involved 10 
kindergartners; and the fourth phase took place in August 1985 and 
involved 6 students who had completed kindergarten and were about to 
enter first grade. The results of the study indicated that speech 
recognition technology holds potential for such educational 
applications as beginning reading instruction. Findings also suggest 
that human factors, such as microphone handling, responses to 
recognition errors, responses to prompts and remarks, and need for 
adult supervision are crucial ingredients in the effective 
application of speech recognition technology in education. (Seven 
tables of data are included and a short bibliography is attached.) 
(NH) 



************************************* v.****^ 

* Reproductions supplied by EDRS are the best that can be made 

* from the original document. 



SPEECH RECOGNXTION TECHNOLOGY S 

AN A ±>t>LI CATION TO 
BEGINNING READING INSTRUCTION 

Tecrhn ±. c a 1 Repoxrt 

Oc:tc3be3r 1 985 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC)." 



U.S. DEPARTMENT OF EDUCATION 

Office Qt Educalional Research and Improvemer.: 

EDUCATIONAL RESOURCES INFOR^TATlON 
CENTER (ERIC) 

OThis document has been reproduced as 
received from the person or organization 
oriQinatir>g il 

O Minor changes have been made to improve 
reproduction quality 

• Points o{ view or opinions stated in thisdoco- 
ment do not necessaniy represent o'hcial 
OERl position or ps,>cy 




EAicatlMai Tc cfa i l i i T Cnler 

Harvvd Graduate School of Education 
357 Gutman Library Appian Way Cambridge MA02138 



4' 



BEST COPY AVAILABLE 



SPEECH REC05NITI0N TECHNOLOGY: 
AN APPLICATION TO BEGINNING READING INSTRUCTION 

Education Development Center, Inc. 
Newton , Massachusetts 

October 1985 



Charles L. Thompson 
Beth Wilson 
Philip P. Zodhiates 
Ilene Kantrov 



Preparation of thio research report was supported in part by 
the National Institute of Education (contract # NIE 400-83-0041). 
Opinions expressed herein are not necessarily shared by NIE and 
do not represent Institute policy. 



ACKNOWLEDGMENTS 

Dragon Systems, Inc., not only provic!ed the speech recognizer used in 
this study, but also served as an important technical resource on a wide 
variety o-f issues. Janet M. Baker, President o-f Dragon, reviewed an 
earlier draK o^f this report, correcting technical inaccuracies and 
suggesting ways O'f analyzing data, hark Sidell, principal hardware 
engineer at Dragon, designed the prototype printed circuit board that we 
used, and trade hardware adjustments based on our "field testing. Jed 
Roberts and David Pinto, research scientists at Dragon, attended portions 
o-f the -field testing, carried out so-ftware adjustments to the recognizer, 
and helped us think about the meaning o-f some o-f our data. 

John H. McCloskey, senior programmer at Education Development 
Center, Inc., (EDO was responsible -for programming the reading instruc- 
tion so-ftware and revising it based cn the -field testing. David Nelson, 
EDC s Director o-f Media Services, conducted the videotaping o-f the test 
sessi ons. 

We are also grate-ful to the sta-f-f and students o-f the Phillips 
School in Watertown, Massachusetts, the Lincoln-El lot School in Newton, 
Massachusetts, and the Plowshares Childcare Program in Newton, Massachu- 
setts, who participated in the -field tests. In addition, the Watertown 
school system kindly loaned us videotaping equipment, which allowed us 
to record the testing, and the Education Collaborative -for Greater 
Boston, Inc., provided space where the testing was conducted. 



V 



ERIC 



TABLE OF CONTENTS 



Introduction. • . . 1 

Background on Speech Technology. .....3 

Methods 8 

Phdse 1 8 

Phase 2 12 

Phase 3 15 

Phase 4 16 

Results and Discussion 17 



ERIC 



Speech synthesizers — devices that produce spoken language ^rom digital 
code--have become -familiar adjuncts to microcomputers and have been built 
into some computers intended -for educational use. Speech recognizers — 
devices that can distinguish among spoken words — are a more recent phe- 
nomenon. Until recently such devices had been either too e>:pensive or too 
poor in performance to support widespread educational use. The highest 
quality recognizers are usually priced in the tens o-f thousands o-f dol- 
lars, while the more aHordable devices o-ften have accuracy levels well 
below the 97 percent experts suggest is necessary -for sati s-f actory use. 

Recently, however, the price-pertormance picture has changed 
dramatically. Dragon Systems, a Newton, Massachusetts-based speech tech- 
nology research and development -firm, has introduced speech recognition 
soKware supported by a printed circuit board that can be priced well 
below $500 ano that ner-forms with 99.3'/. accuracy, according to a standard 
test o-f isolated word recognition. Though Dragon's device is limited to 
small active vocabularies o-f isolated words (sixteen to thirty-two words 
at a time), its total vocabulary is constrained only by available memory. 
Its development signals the advent o-f high-per-f ormance , low-cost speech 
recognition and places this technological potential within the reach o-f 
developers o-f educational so-ftware. 

A reliable speech recognition capability available at a low price 
appears to hold considerable potential -for education, especially "for 



ERIC 



7 



educational tasks in which speech is essential, such as early reading 
instruction. In this study we have used the Dragon Systems Mark II Iso- 
lated Word Speech Recognizer to investigate two principal questions: 

--Does an inexpensive, microcomputer-based speech recognizer per-form 
reliably enough on young children's speech to permit application to 
reading instruction? 

— What are the main human 'factors attending such use? 
It should be emphasized that our research has -focused almost exclusively 
on technical and human -factors issues, rather than on pedagogical or 
psychological questions. 

The Marl: II is mainly in soUware ^orm. To operate in a standard 
Apple II or He microcomputer, the Mark II's only requirements are the 
printed circuit board designed by Dragon and an inexpensive commercially 
available microphone. No other additions to tne standard one-drive con- 
-figuration are required. 

The system is speaker-dependent, requiring each user to train the 
recognizer by giving a -few samples 0\ his or her pronunciation o-f each 
word to be recognized. The Mark II analyzes these samples and constructs 
teinplates against which to compare subsequent utterances. In some appli- 
cations speaker dependence is considered a disadvantage because it 
requires each new user to train the system and uses only that speaker's 
own templates -for -future recognition. Our preliminary research, however, 
indicated that speaker dependence might in -fact be an asset with the 
highly variable pronunciation o-f young children's speech. 

For use with beginning readers, o-f course, text display ooes not 
su-f-fice as the sole or even primary mode o^f output. The obvious alterna- 
tives are graphics, miisic, and speech synthesis. In an area such as 



ERIC 



8 



reading instruction, speech output is especially important. Moreover, the 
speech output must be o-f suHicient quality to introduce new words to 
students, not merely to produce recognizable utterances o-f words already 
known. In this study we have experimented with two diHerent methods o-f 
producing high quality speech output -from digital code. Both rely on 
compression o-f actual recorded speech rather than on text-to-speech syn- 
thesis. 

Prototype versions o^f soKware under development by EDC were used in 
the study to provide the reading tasks and the graphics and computer- 
generated music used to prompt and reward reeding per-f ormance. 

BACKGROUND ON SPEECH TECHNOLOGY 

Existing speech recognition systems diHer along a number o-f dimensions: 
speaker independence vs. speaker dependence, isolated word vs. continuous 
speech recognition capability, and vocabulary capacity. 

Most currently available systems are speaker-dependent, that is, 
they must be trained to respond to individual speakers. Because o-f the 
acoustic variability in phoneme production among dif-ferent speakers, these 
systems are more reliable than speaker-independent ones, which are inten- 
ded to work -for all users. Speaker-dependent devices work by having each 
user provide a -few samples o-f his or her pronunciation O'f the speci-fic 
words to be recognized. The recognizer samples the voice wave-forms 
thousands o-f times per second, digitizing them and computing in^formation 
on the -frequency and temporal characteristics o^f the speech. From this 
in-formation the device constructs templates o^ the sample utterances and 
stores them in memory. Each speaker must repeat each word several times 
so that the templates can adequately capture the variability o-f that 



particular voice. Some systems, including the Hark II, average together 
all sample utterances (or tokens) to construct one composite template; 
many others store each token as a distinct word. Although insuH.*cient 
training data is a major cause o-f speech recognition errorSi training can 
be tiresome -for users, and there-fore most systems currently suggest only a 
-few tokens o-f each word. In addition, most systems are not computa- 
tionally capable o-f handling more training date. Hi gh-per-f or mance 
speaker-independent systems typically require a very large data base, 
constructing their templates -from hundreds or thousands of diHerent 
speakers. 

Once usable templates have been constructed, the recognizer compares 
new utterances to them, using statistical algorithms that describe the 
acoustic parameters oi words. Some systems are word-based; they compare 
all o-f the information extracted frow a new utterance to the stored tem- 
plates and find the best statistical match between the utterance and one 
of the templates. Other systems attempt to segment the speech waveform 
into acoustically distinct regions and to compare selected portions, 
rather than the entire waveform,- Although this feature-based method 
reduces the amount of computation required in template matching, and thus 
is cheaper and more efficient, it often sacrifices a degree of accuracy. 
Some recognizers use a process called dynamic time warping in which the 
waveform of an incoming word is figuratively stretched or compressed to 
match a stored template. This technique gives the recognizer added flexi- 
bility in compensating for variations in user pronunciation. 

Speech recognition systems also differ on the basis of whether they 
recognize single isolated words or continuous speech. Some of the nost 
sophisticated — and nost expensive — systems are capable of recognizing 



10 



continuous speech with total vocabularies o-f up to 500 words. Users 
develop an application-specific grammar that predetermines groups of 
meaningful word sounds* Spoken words are recognized as valid only if they 
conform to this predefined grammar. This reduction in the number of 
possible word combinations increases accuracy and conserves computation. 
Not surprisingly, the cost of a system with this degree of sophistication 
places it ueW out of reach for any widespread educational application. 
And even these recognizers designed to work with continuous speech perform 
better with isolated words, so much greater is the acoustic variation of 
words spoken in continuous speech than of words spoken individually. 
Most speech recognizers have a total vocabul ary--say ^ 100 words--and a 
smaller active vocabul ary--say , 35 words--which they are actively 
"listening for" at a given time. These words are stored in rando(r,-access 
memory where they can be quickly and easily available for matching with 
incoming utterances. The majority of systems recog.*ize ouly isolated 
words, and the template matching process begins as soon as the end of a 
word is detected, fontinuous speech systems may begin template matching 
before the word is completed. In either case, as vocabulary size 
increases, so do storage and computing requirements, and thus cost. 

A major factor in the usefulness of any system is its accuracy in 
recognizing the vocabulary on which it has been trained. The primary 
means of assessing reliability is to look at how well a system perforins 
with respect to two principal kinds of errors: substitution errors, or 
the mistaking of one word for another; and rejection errors, the refusal 
to recognize a valid utterance. These two types of errors have a partly 
reciprocal relationship; if the system is designed to make fewer rejection 
errors (rejecting fewer "correctly" pronounced words), it is therefore 



11 



likely to make more substitution errors, that is, recogni^in words incor- 
rectly. A third kind o-f error, insertion — the "-ecognition o-f background 
noise or vOther speech as a valid word--.jay also be considered in deter- 
mining accuracy levels. 

One o-f the factors that Cun compromise accuracy has already been 
mentioned: variability, both among different speakers and among the same 
speaker's utterances* Aside from the acoustic characteristics of phoneme 
productirn, this variability stems from changes in voice volume and from 
nonspeech-rel ated sounos such as tongue clicks^ breath sounds, and inad-- 

vertent "um't" and "er's." Some individuals seem to have more variable 

« 

speech than other?, and conventional wisdom within the field of speech 
technology holds that approximately 80 percent of recognition errors occur 
among 20 percent of users. An additional difficulty is posed by the noise 
environment in which tne recogn^rer is being used. This may include noise 
from machinery, movement, and electronic noise, as well as background 
voices. In fact, competing background voices are usuallv more disruptive 
than other environmental noise. Some recognizers use a calibration systea 
so that they can be adjusted for different noise conditions; others are 
set at a fixed level* 

The quality of the microphone also influences reliability* A system 
intended to function in a noisy environment has to screen out a great deal 
of background noise* In the process it may also screen out some valid 
utterances and thus produce a higher rate of rejections. On the othe'- 
hand, screening out less background noise will likely produce a higher 
rate of substitutions (for example, recognizing "car" for **tar**) and 
insertions (for example, recognizing background noise as a word)* In some 
environments a high performance directional headset jnicrophone is neces- 



ERIC 




sary to obtain an adequate level o'f accuracy. 

Until now, applications oi speecn recognition technology have been 
limited mainly to business and industrial settings. For example, speech 
recognizers have been used -for office automation, automatic telephone 
transactions, and inventory control and quality assurance inspections in 
factories. In this study we explore the potential of speech recognition 
technology in education. 

The Dragon Mark II, by combining some strategies of word-based tem- 
plate matching with more feature-based recognition, and by using more 
sophisticated algorithms based on stochastic processing models of 
language, has been able to achieve a V9.3 percent accuracy levelXl/ with 
greatly reduced computational requirements. Its vocabulary, though sfr.all 
(16-32 words active, 200 total), is more than adequate for use in many 
educational tasks, such as beginning reading instruction. Speech recog- 
nizers like the Mark II iftay enable microcomputers to supplement the work 
of the teacher or other skilled reader in confirming and correcting the 
efforts of young learners. 

Synthesized speech output, though not the focus of investigation in 
this study, deserves brief mention here because it provides a necessary 
and important complement to the speech recognizer. Speech output, 
especially for students who cannot yet read, becomes tne primary means of 
providing instruction and feedback tc the child. In our study we tested 
two different methods of producing speech output, both of ''hich relied on 
compression of actual recorded speech. Initially, we employed a speech 
synthesis technique that sampled recorded speech at a high rate. Though 
this method produced natural-sounding speech that was easily understood by 
young users, it had the disadvantage of requiring large quantities of disk 



13 



storage space and multiple drives. Consequently, we replaced the extra 
disk drives with a Texas Instruments speech synthesis chip; this more 
econofrtical method used linear predictive coding coe-f-f i ci entD which were 
computed from recorded live speech. We wanted to determine whether its 
somewhat less natural sound would still be adequately understood by kin- 
dergarten and first grade students. 

METHODS 

Field testing of our prototype speech recognition system was carried out 
in four phases. 

Phase *; 

The 'first field testing occurred on two consecutive days in June 19B4, in 
a Watertown, Massachusetts, public elementary school. The participants, 17 
kindergarteners — 12 boys and 5 girls--were nominated by their teachers and 
were given the option of participating in the study. Standard school dis- 
trict policies on parental and student consent were followed. Teachers 
were asked to choose students from a range of ability levels; according to 
teacher classification, the test group comprised 5 low, 5 average, and 7 
high ability readers. Seven students were 5 years old, nine were 6 years 
old, and one was 7 years old. 

Students were taken from their regular classrooms and conducted to 
another room in the school where the system had been set up. To reduce 
any anxiety associated with the experiment, students were taken in pairs. 
The entire testing procedure for each student required from fifteen to 
twenty minutes. 

Upon entering the room students were seated together in front of an 
Apple He microcomputer equipped with an inexpensive, commercially avail- 



H 



able microphone. On the -first day children held the microphone in their 
hands; on the second day the microphone was placed in a stand to determine 
whether a stationary position would increase the accuracy of the speech 
recognizer. The testing was conducted by three EDC sta-f-f members: one, 
re-ferred to here as the experimenter, assisted the students as they took 
turns completing the series of activities involved in each test session; 
another collected observational data; and a third videotaped the sessions. 

First,, students were pretested on the eight words used in the proto- 
type program, plus two words that controlled the menu, ^es and no. The 
students were shown these ten words on index cards and asked to read them. 

Following the pretest the experimenter explained to students that 
they would be learning some new words by reading a story and playing a 
game with the computer. They were told that the microphone would help the 
computer to hear them and were instructed in how to hold the microphone. 
Background ncise levels were monitored and recorded with the use o-f a 
professional quality sound meter. 

The system was then booted up, and children began to receive their 
instructions from the synthesized speech output, provided in this phase by 
the higher quality speech compression method. The training was begun as 
the speech output told participants that the computer needed to know how 
they said some special words. Beginning with the first word, the speech 

output instructed students to "please say . *' After each utterance 

the computer either signaled acceptance by moving the word on the screen, 
completing part of the graphic, and playing a musical tone, or instructed 
the child to say the word again. If, after four utterances were accepted, 
the system was able to construct a usable template, the child was rewarded 
with a musical phrase. In the case of a bad template, the child was 



ERLC 



15 



instructed, "Sorry, we need to do that one again," and the training waci 
repeated -iror that word. Once all the templates were -formed, the recogni- 
zer checked them by asking the child -for one more sample oi each word. In 
some instances a second round o-f retraining was necessary. 

The testing continued with the reading o-f the story. The computer 
produced speech output -for the text as it appeared on the screen, accom- 
panied by graphic illustrations. Target words were shown in large let- 
ters, and the speech output instructed the child to "read the big word." 
Then, -for each target word, the program would pause, giving the child the 
opportunity to read the word aloud. The recognize had to match the 
child's utterance against the stored tetpplates to determine whether the 
, child had responded correctly. I-f the child supplied the correct word, he 
or she was rewarded with a graphic display and a musical phrase. I-f the 
Child made a mistake, the computer repeated the sentence stem -from the 
story and again paused -for the child to read the' word. 1-f the child did 
not produce the correct word on the third trial, the computer provided the 
answer. 

The -final component o-f the prototype so-ftware was a game composed o-f 
a 3k3 matrix containing eight words and an empty center square. By 
reading the word ^rom a boK that bordered the empty square, children were 
able to move that word into the empty boK. The objective was to move the 
word which began in the lower le-ft-hand corner to the upper right-hand 
corner. The game required not only reading, but also a certain amount of 
strategy and planning, as only the words next to the empty square could be 
moved. At this early stage o-f the project the instructions -for the game 
were not available by means o-f speech output and had to be provided by the 
experimenter. The experimenter also provided help when needed in reading 




the words and planning strategy as children progressed through the game. 

In the game, as in the story, the recognizer had to match tvhe child's 
utterances against stored templates. Whereas in the story the recognizer 
listened -for the one correct target word -for each -frame and rejected all 
other words, in the game its task was more diHicult. It had to listen 
"for and discriminate among all ten words at once. In both the story and 
the game segments, the possible recognition errors were rejection o-f 
correct responses supplied by the child or acceptance o-f incorrect 
responses. In addition, the recognizer could "fail to respond at all 
because o-f problems related to microphone position or voice volume, or it 
could accept extraneous human or nonhuman environmental noise. 

At the conclusion 0"f the session, students were posttested to deter- 
mine how many words they had learned. In turn, they were presented with 
the words on index cards as in the pretest. They were also asked a "few 
questions about their experience with the system, 

In addition, we collected structured observational aata on three 
types 0"f recognizer error: rejection (a valid utterance 0"f a trained 
vocabulary word is not recognized as such), substitution (one word is 
recognized "for another), and insertion (background or extraneous speaker 
noise is recognized as a valid utterance). For each error we noted when 
it occurred in the sequence and what might have caused it. We hypothe- 
sized that errors would be caused by students speaking too so"ftly or too 
loudly, by their holding the microphone too "far "from the mouth, by extra- 
neous background noise or speech, or by other "factors we could not antici- 
pate but hoped to identi"fy through observation. 

Through semi -structured observation, we also collected data on hu«an 
factors associated with the educational use o^ speech technology: 






prompting, microphone handling, and response to recognizer error* We 




noted whether children were able to comprehend and respond to the synthe- 


• 


sized speech output and whether they were able to use the microphone 




appropriately and modulate their voices effectively enough to train the 




machine. We also observed their responses to machine errorsi including 




"no-hears" and misrecognitions, and we noted the level of instructional 




support ano prompting required for microphone use, for interpreting feed- 




back from the recognizer, and for persisting with the task, as well as for 




help in word reading and generating game-related strategies. 




In addition to these observations, we asked students about their 




experience with the speech recognition system. Our questions were 




designed to determine what they liked most and least, found easiest or 


• * 


hardest, or would like to see changed. In addition, we wanted to know 


• 


whether they had any prior experience using computers and whether they 




would like to use the speech recognition system again in the future. 




Finally, we videotaped all the testing sessions and interviews in 




Phase 1, both for purposes of simple documentation of the experiment and 




for possible future use in a tape designed to introduce speech recognition 




applications and issues to New England educators. 




Phase 2: 




In November 1964, we carried out a second round of testing, again in the 




same Watertown public elementary school. Seven kindergarteners and eight 




first graders tested the spev?i:h recognition system over a period of three 




consecutive days. Of the kindergarteners, three were girls and four were 




boys; two were 4 years old, and five were 5 years old. Of the first- 


• 


graders, five were girls and three were boys; seven were b years old, and 


* 


one was 7 years old. As in Phase 1 they were nominated by their teachers 


ERIC 


OO 



who rated their reading abilities. Qi the kindergarteners, one was rated 
as low in ability^ two as average, three as high, and one was not rated. 
0{ the -first graders, two were rated as low, two as averagei and ^our as 
high in ability. 

Phase 2 included certain modifications in the speech recognition 
systeni itseH and in the sequence oi reading activities. Speech output 
was produced with the speech synthesis microcnip from Texas Instruments. 
The major advantage oi this modification was to reduce the required number 
of disk drives frotn four to one. Its one disadvantage was to yield speech 
output of slightly lesser quality than the previous method. To facilitate 
training, the speech recognition system was adjusted to require a pause of 
a fraction of a second before each sample utterance; this was intended to 
prevent acceptance of only fragments of words inadvertently spoken over 
speech output prompts. In addition, adjustments were made in both hard- 
ware and software to improve the recognizer's performance. These adjust- 
ments included finetuning of the software parameters for recognition as 
well as changes in the gain setting and the volume threshold for incoming 
utterances. Adjustment of the gain affects sound amplification and there- 
fore the recognizer's responses to background versus foreground noise. A 
separate hardware adjustment raised the upper volume level which the 
recognizer would accept. The microphone was the same one used in Phase 1, 
and it was again placed in a stand in front of the computer. 

In this phase a second set of eight words was added, making a total 
of eighteen possible words. Students were pretested on all eighteen 
words. Each student trained the system on the first set of words and then 
read the story or played a game. For nine students this process was 
repeated with the second set of words, but with the difference that for 




ERIC 



both story and game the two sets oi words we-e randomly intermingled. 
Because o-f time constraints, six students, did only the first set. A 'full 
session including both sets o^f words required between thirty and forty 
Ainut es. 

Training in Phase 2 included some different ways of prompting stu- 
dents to repeat the words. On the first three vocabulary words the 
machine used a 5-second delay after each utterance to allow the child to 
repeat the word again. If the child failed to do so, the computer 
prompted again by asking the child for another utterance. On subsequent 
words a 10-second delay was used to allow the child even more time to 
anticipate the prompt and thereDy move through the training more quickly 
and eliminate the boredo^i of hearing the prompt again and again. 

Second, if after two attempts at training a particular word, the 
recognizer still did not have a good template, the word was moved to the 
end of the list. This was aone to eliminate borecom associated with having 
to repeat the same word many times in succession. Finally, for the obser- 
ver's information the computer provided different output tones ^or words 
that were too loud or too soft, as well as for those that were in the 
appropriate volume range but rejected for other reasons. 

Prompts were also modified in the story. If children initially sup- 
plied an incorrect word they were given a "watch this and then try again" 
prompt followed by a graphic clue. If they still failed to supply the 
correct word, the computer read them the answer. 

A new "concentration" type game was added in this phase. Students 
had to match the shapes behind the words in a 3x3 matrix. The shapes were 
exposed by reading the words. Eventually, as matches were lade, the 
picture of a monkey emerged. Some children played this game, some played 



ERIC 




the game described in Phase 1, and some played both. 



Pre- and posttesting were conducted, and other structured u.»d semi- 
structured observational data were collected as in Phase 1. In addition, 
recognizer error data were recorded ior all portions oi the prototype 
soHware. 

Phase 3: 

A third round o-f testing was conducted in late December in a public 
elementary school oi Newton, Massachusetts. Oi the ten kindergarteners 
who participated, -five were boys and -five were girls. Three were rated by 
their teachers as low ability readers, two as averagej and -five as high. 
Two were 4 years old, six were 5 years old, and two were 6 years old. 

The major innovation in this phase was the use o-f a headset micro- 
phone. Though this microphone was less expensive and oi lesser quality 
than the one used in Phases 1 ana 2, it was hypothesized that it would 
improve the per-formaince o-f the recognizer by remaining a consistent dis- 
tance from the child's mouth at all times. To facilitate training, syn- 
thesized speech was used to prompt students when their responses were too 
loud or too soU. The brief pause required before each training utterance 
was retained from Phase 2, but shortened. In addition, a number of further 
adjustments to the recognition software were aimed at improving its per- 
formance. 

The third field test used the same sequence of activities as in Phase 
2 and the same eighteen words. The major addition to the program was a 
game of tic-tac-toe in which the computer played against the child. Other 
modifications included the addition of several prompts and directions in 
each segment of the program. At the beginning of the training, the compu- 



ERLC 




ter instructed children explicitly and demonstrated how they should ti«ne 
their utterances by watching the movement oi the word across the screen. 
The conputsr also provided feedbv'ick oi "louder please," "I can't hear 
you," and "That's too loud" lov uttejances that were outside the appro- 
priate volume range io\ the recognizer. In the story, initial wrong 
responses were followed by a "try again" prompt, and second wrong 
responses were -followed by thu^ "wat:h this and then try again" prompt with 
a picture cue, H the child failed to respond at all, the computer waited 
10 ascends and then delivered the latter prompt and cue. At the beginning 
oi each game the computer began by giving instructions and a demonstration 
of how to play. During the Ccim?s, if the child made an incorrect response 
or failed to respond, the computer repeated the directions or suggested a 
correct response. 

Phase 4: 

A fourth and final round of testing took place in early August 1985 with 
students enrolled in a private summer program in Newton, Massachusetts, 
Of the six students who participated, four were boys and two were girls; 
all had completed kindergarten and were about to enter first grade in 
September, Five of the children were six ye?.rs old, and one was five 
years old. In this in^5tance we did not receive teacher ratings of stu- 
dents' reading abilities. 

The major innovation in Phase 4 was the use of a high quality, noise- 
cancelling, headset microphone. All other components of the speech recog- 
nition system, with the exception of an adjustment to e gain setting, 
remained the same as in Phase 3, In previous rounds of testing we had 
explored a variety of human factors and technical i ssues--Bl crophone 
handling, children's reactions to recognizer errors, children's responses 



22 



to prompts, the need -for adult supervision, proper settings "for the recog- 



nizer — and we now wanted to examine the e-f-fects o-f microphone quality. We 
hypothesized that a good quality, noise-cancelling microphone would 
ifliprove the per.ormance of the recognizer, 

RESULTS AND DISCUSSION 

Our data on errors address our question about the reliability o-f the 
recognizer with young children's speech. In the Phase 1 testing, all but 
one student had to retrain at least one of the ten vocabulary words. That 
is, the recognizer could not form a satisfactory template from the initial 
four utterances, making it necessary for them to give four more. The 
range of training repetitions needed was from zero (one child) to ten (one 
child), with a mean of 3,9, The training process proceeded more smoothly 
in Phase 2 than it had in Phase 1, On the original set of ten vocabulary 
words the range of training repetitions needed in Phase 2 was from zero 
(four children) to eight (one child), with a mean of 2,2, This improve- 
ment is attributed to modifications in the recognition software, hardware 
adjustments, and the stabilization of the microphone. In Phase 3 the 
range of training repetitions was from zero (one child) to nine (one 
child), with a mean of 3,3, The increased prompting and feedback during 
Phase 3 training helped students learn how to speak to the computer and to 
move through the training more quickly, although the amount of retraining 
required was somewhat higher than in Phase 2, 

The use of a high quality, noise-cancelling headset microphone in 
Phase 4 resulted, with one exception, in an overall improvement in the 
performance of the recognizer. In training, for example, the new fiicro-- 
phone led to a significant reduction in the number of training repeti- 



ERIC 




tions. 



The one exception was the word "feed," which alone accounted for 
eight oi the twenty-one training repetitions in Phase 4, ("Feed" also 
presented problems for the recognizer during the story and game portions 
of the program; see below.) The recognizer's difficulties with this word 
are probably due to the presence of a long "e" in "feed," A vocalized 
long "e," such as the one in "feed," produces a low energy sound which is 
inherently difficult for any microphone to capture accurately. The prob- 
lem was particularly severe with the microphone we used; it seems that the 
microphone lacked sufficient sensitivity to handle "feed." A simple, non- 
technological fix would have been to say "feed" a little more loudly. A 
more permanent solution would be to adjust the gain setting so that the 
recognizer becomes more sensitive to low energy sounds. 

Excluding "feed," the amount of retraining required by the recognizer 
in Phase 4 ranged from zero (one child) to four (two children), with a mean 
of 2.2. (The mean for training repetitions per child when "feed" is 
included climbs to 3.5, and the range increases from zero to seven.) This 
information on retraining is contained in Table 1. 



In the story and the game the majority of errors were rejection 
errors, that is, the recognizer refused to accept a valid utterance of a 
vocabulary word. In Phase 1 the story rejections accounted for 24 of the 
28 total errors. In addition, there were 3 insertions and 1 substitution. 
In the game, rejections were still the largest category of error, but by a 
smaller proportion. They accounted for 41 of the 90 total errors, along 
with 37 insertions and 12 substitutions. In Phases 2 and 3 rejections 



[Insert Table 1 about here] 




continued to be the «ost common type oi error, with the combined rates o-f 
substitutions and insertions reduced to 4 percent oi Phase 2 errors and 6 
percent oi Phase 3. 

In Phase 4 we collected separate data ^or error rates in the story, 
game, and menu sections oi the program. The story had only one rpjection 
error: the recognizer refused to accept a valid utterance oi the word 
"<eed.** Games, as expected, were more problematic: 28 rejection errors 
were recorded, oi which 14 involved the word "feed"; in addition, there 
were 15 substitution errors, oi which 9 were attributable to "feed." The 
menu — that portion oi the program that allows students to move between the 
various sections of the program by responding with a "yes" or "no"-- 
produced 22 rejection errors which were evenly divided between "yes" and 
'•no"; in addition, there were two substitution errors (see Table 2). 

The data on frequency of errors in the story and games show an 
overall reduction over the course of our testing in the proportion of 
insertions and substitutions and a corresponding increase in the propor- 
tion of rejections. This shift in error frequencies was the expected 
result of the between-phase adjustments to the recognition software and 
gain settings, and provides evidence that these adjustments improved the 
performance of the recognizer. As a general rule, rejection errors do not 
present a significant problem for tne user, so long as the number of 
errors remains relatively small — say no more than three at one time. The 
user simply repeats a given word a few times until the recognizer accepts 
the utterance as valid. Substitution and insertion errors, however, are 
iiuch more serious. When the recognizer substitutes a different word for 
the one uttered, or recognizes an extraneous noise as valid, the effect of 
even a single error is to mislead or confuse the user. 



[Insert Table 2 about here! 



T*- ugh this study was e>:ploratorY in nature and not designed to 
generate precise measurements oi recognizer accuracy or comparisons across 
phases, we did select a s(nall number o-f p/«rticipant£ from Phases 2 and 3 
and all the participants in Phase 4 for more detailed analysis oi data on 
recognition accuracy during game5--the testing activity that posed the 
most difficult speech recognition task* Using videotapes, we chose chil- 
dren who represented a range m terms of their total utterances and total 
errors* Error rates, shown in Table 3, were calculated for each child by 
dividing the number of each major type of error — rejections and wisrecog- 
nitions — during a game by that child's total utterances during the game* 
These error rates should be viewed cautiously within the content of the 
field conditions in which they were obtai ned--condi ti oni; which differed 
dramatically from the environment in which the Mark II rkchieved a 99*3 
percent accuracy score (see note 1). Given first-time child users^ micro- 
phone differences, and a natural noise environment, we might conclude that 
the Mark II performed surprisingly well for most children. Even those 
children for whom the recognizer performed less well were able to read 
more words by the end of tfje session and, when interviewed, reported that 
they found the game the "<nost fun" and that they would like to use the 
system again at another time* 



We noted several possible explanations for recognizer crrorSe Some 
of then (nay have resulted from environmental background noise. Though the 
settings in which testing occurred were generally quiet — even compared. 



[insert Table 3 about here] 



ERIC 




sayi to a regular classroom — there were moment to moment variations as 
announcements were made via loudspeaker or as supervised groups of chil- 
dren passed through the room. These factors were present mainly in the 
first two days of Phase 1, when the testing area was in the school gymna- 
sium. Subsequent testing took place in a school library and computer room 
where such interruptions did not occur. Probably more related to speech 
recognition errors were extraneous verbalizations by participants or their 
partners. This was particularly evident as a cause in Phase 1 for the 
much higher rate of insertion errors in the game then in the story. 
Because many of the children needed assistance with the rules of the game 
and with strategy, as well as with reading the words, there was a great 
deal more conversation back and forth between the experimenter and the 
children and between the children and their partners. The recognizer at 
times accepted this background conversation as a valid utterance. The 
addition of prompts and explanations of game rules to the program itself 
helped to reduce the need for background conversation in Phases 2, 3, and 
4. Unlike many speech recognition systems, the Mark II uses an open 
microphone; that is, it does not require the speaker to use a press-to- 
talk button or a microphone on/off switch. This, makes the system easier 
to use, but it also requires the system to work harder to distinguish user 
speech from other speech and noise in the environment. 

In some cases, the handling of the microphone appeared to be the 
cause of problems. On the first day of Phase 1 testing, when a hand-held 
microphone was used, many children tended to hold the fflicrophone 
too close or too far away from their mouths or to wave it around, making 
it difficult for the recognizer to construct usable templates of their 
speech or to match subsequent utterances once the templates were stored. 



27 



Use o-f a microphone stand on the second day O'f the 4irst Held test 
reduced these problems and contributed to a decrease in the need -for 
retraining and in certain types o-f errors during the story and game. In 
Phase 2) continued use o-f the microphone stand and an adjustment o-f the 
ampli^cation improved per-formance even more, especially during training. 

Some errors may have been due to variability among the utterances o-f 
particular children as they tried to accommodate to the system. For 
example, i-f children were initially shy and spoke so-ftly to the computer, 
the recognizer may have -formed templates -from only the portions o-f their 
utterances that were loud enough -for it to hear. Later, as students 
became more con-fident and talked more loudly, the recognizer may have 
produced errors because their utterances did not match well against the 
templates -formed earlier. This explanation, -for example, may account -for 
the large number o-f rejection errors involving the word "yes," the -first 
word trained by the students. In Phases 3 and 4. the combination o-f a 
headset microphone and gain adjustments kept recognizer per-formance stable 
during the story and game while allowing children to speak less loudly 
than in Phases 1 and 2. 

Several human -factors emerged as important considerations in speech 
technology applications with young children. A hand-held microphone nay 
be more than adequate when care-fully and consistently used, but our data 
suggest that -four- to six-year-old children, especially when less closely 
supervised than they were in this experimental setting or when their 
attention is devoted more to the task at hand than to the way they are 
holc'ing the microphonp, are su-f -f i ci ent 1 y erratic in their handling o-f the 
microphone to reduce substantially the recognizer's accuracy. 

Use o-f a microphone stand helped considerably but still required 



ERLC 




children to ctonitor their position relative to it and to learn to spsak 
into it. One o-f the advantages o-f the headset microphones usedNin Phases 
3 and 4 was that they required the least conscious attention -from parti- 
cipants. A problem with the headset microphones we used, however, was 
that they were designed ^or adult heads and adjusted poorly to iit young 
children. Thus, they sat precariously on some children and were mildly 
uncom-f ortable -for others. In addition, the particular microphone used in 
Phase 3 had a tendency to pick up static, which caused the program to 
crash. After several such episodes on the first day of Phase 3 testing, 
we solved this problem by using an anti-static spray around the computer. 
Nevertheless, f.obloms of microphone quality persisted, affecting the 
performance of the recognizer. The use of a better quality headset micro- 
phone in Phase 4 resulted in some improvement, particularly in the story, 
but recognizer performance continued to be troublesome in the game portion 
of the program. Some additional improverr.ent can- be e:ipected with the use 
of a headset microphone that is designed for children's heads. 

Another human factor of interest was whether the variability in the 
volume or pronunciation of children's speech would produce problef.;s for 
the speech recognizer. We found that most children had an initial ten- 
dency to speak too softly, but that most quickly became accustomed to the 
volume level required, especially in Phases 3 and 4 when this level was 
closer to the one they used in normal conversation. The use of the "too 
loud" and "too soft" prompts in Phases 3 and 4 helped children to modulate 
their voices appropriately and to do so with less prompting from the 
experimenter. Pome children, usually those who were having the ftost 
difficulty getting the recognizer to accept their utterances, sought to 
accommodate the machine by altering their pronunciation; they enunciated 



29 



ffiore clearly and spoke more loudly or more slowly, all techniques appro- 
priate ior a human listener but likely to aggravate the problem ^or a 
speech recognition device. Fortunately, this occurred with only a -few 
children. In addition, certain vocabulary words seemed to produce more 
errors than others. The word "elephant" produced the most errors in 
Phases 1 and 2» and the second most in Phase 3. "Yes," "^eed," and 
"monkey" also proved diHicult ^or the recognizer. As already noted, 
"4eed" proved especially troublesome in Phase 4. 

As we .collected our data on machine errors, we also recorded chil- 
dren's reactions to these errors and any possible e-ifects on their per-for- ' 
mance or their enjoyment o^ the activity. We ^ound considerable variabil- I 
ity on this point. Some children seemed not to mind having to retrain \ 
several words and even benefited -from the extra opportunities to see the 
word on the screen. A ^ew became annoyed with having to repeat the same 
word so many times. This was ameliorated in Phase 2 when the program was 
modi-fied to take a word that requirec more then one retraining and move it ! 
to the end o-f the list, rather than train it again -for the third con- ! 
secutive time. The "too loud" and "too soU" prompts in Phases 3 and 4 | 
also helped by letting the children know that the recognizer had not 

i 

"heard" them and why. 

Recognition errors in the story and game were also received diHer- j 
ently by diHerent children. Some children were very persistent in their 
attempts to make the recognizer understand them and would conHdently 

repeat answers that were rejected. Other children waited passively i^f i 
their -first attempt was not accepted. Some children were quite willing to i 
guess, even on un-familiar words, while others pre-ferred to wait inde- 
finitely -for prompts or to rely on the experimenter -for assistance. 

j 
i 

I 

o 30 
ERJC 

j 



Another human factor we were concerned about was whether children 
would be able to understand the speech output needed to deliver prompts, 
directions, and rewards. In fact, the speech output proved to work very 
well. Even the lesser quality speech output used in Phases 2, 3, and 4 
proved to be well understood in the connected language output portions 
that instruct students in how to use the machine and how to respond to the 
tesl's presented. Even in the training segment, when single words were 
initially presented without context, most children had no difficulty 
understanding the words. A few made minor errors such as hearing "seed" 
for "feed" or "carrot" for "parrot", but corrected their errors as the 
feed and parrot graphics appeared on the screen. Only three children 
persisted in their "mishear" once the graphics were displayed, and had to 
be told the correct word by the experimenter. 

In certain places we found that we had to modify or add speech output 
prompts to guide students toward patterns of use that would make the 
recognizer work more reliably or would make the student's experience more 
satisfying and rewarding. The "too loud" and "too soft" prompts already 
mentioned are one example. Another is the set of prompts eventually used 
to introduce and djmonstrate the training procedure. Because each word 
has to be repeated several times during the training, some means is neces- 
sary to move children along from one repetition to the next with appro- 
priate pauses between. Initially this was done through "please say" 
prompts to cue the utterance and changes in the graphic display to reward 
it. Me expected that students would attend strongly to the graphics and 
begin to anticipate the speech output prompt. When this failed to happen 
in Phase 1, we inserted delays after the "please say" prompts in Phase 2 
to give students an obvious opportunity to override the prompt and use the 



31 



graphic reward as a cue that the recognizer was ready ior the next repeti- 
tion. Either because the speech prompts were simply much more salient 
than the graphics or because the pattern o^ waiting ior the "please say" 
prompt had already become too firmly established, most children did not 
spontaneously begin to override the speech prompt. Thus? the delayed 
prompt, rather than speeding the training process, had the eHect o-f 
slowing it down. Finally, in Phase 3 we modified the prompt so that it 
explicitly instructed children to attend to the moving word on the screen 
and demonstrated how to do so. This explicit approach appears to be 
necessary, at least with kindergarten and first-grade students. 

We also looked at the level of adult support and supervision required 
to keep students moving along m the program activities. We started in 
Phase I with sparse prompts and directions and found that students tended 
to rely on the experimenter to tell them when and how to respond. By 
Phase 3 the number of prompts, directions, and f^etponstrations had 
increased in all segments, lessening the need for experimenter super- 
vision. Even by Phases 3 and 4, however, adult supervision was clearly 
required, not only to make sure children understood what they were 
expected to do, but also to provide encouragement for them to try Again 
when the recognizer rejected their correct responses or did not respond to 
them . 

This need for adult supervision results from an inherent limitation 
of the program: unlike the human prompter, the program cannot distinguish 
between recognizer error and child error. It cannot, depending on the 
case, encourage a child to repeat the correct word again (explaining, as 
the experimenter did in our tests, that "sometimes the machine doesn't 
hear you correctly") or to try a different word. 




Certain aspects oi children's interaction with program prompts bear 
on program design and illustrate current constraints of speech technology. 
Because the speech synthesis chip produces speech that never varies in 
intonation, recurring lengthy instructions or explanations can prove 
annoying to users. This effect is compounded by the program's inflexibil- 
ity: while a human prompter will adjust or interrupt instructions depen- 
ding on a child's facial or verbal reactions, the program must continue 
until it has finished what it was programmed to say. 

Two ways to deal with this irritation factor are to use speech 
prompts that are as brief as possible and to include more than one version 
of a prompt that is to be repeated. For example, the responses to too 
loud or too soft utterances had two versions which the program used alter- 
nately. "Louder, please," alternated with "I can't hear you," and "That's 
too loud" alternated with "Not so loud, please." 

As m their varied responses to program errors, children also dif- 
fered in their reliance on prompts. Some were reluctant to speak at all 
and waited for either the program or the experimenter to urge them to 
respond. Several required the nodded approval of the experimenter before 
they would respond to the prompts in the prografn--requiring two levels of 
encouragement. Other children, by contrast, were impatient with prompts; 
they frequent:/ spoke while the program was prompting and therefore was 
not ready to listen to them* The program has only a limited ability to 
react to such variations in personality style. It will, for instance, 
repeat prompts if a child does not respond at all, but it cannot adjust 
the timing of such repetitions for each user, nor can it vary the content 
of the prompts to suit individual needs. 

The program, then, is quite responsive in a "human*'-l i ke nay to some 



33 



oi a child's input, but is both inflexible and undiscnminating in other 
aspects oi its interaction. Our observations suggest that most oi the 
children who participated in our study readily adjusted to this combina- 
tion oi program capabilities and limitations. 

Finally, we were interested in the instructional potential o-f speech 
technology -for beginning reading. The data we have thus far collected 
indicate that such potential does exist. Tables 4, 5, 6, and 7 contain 
pre- and posttest data on numbers oi words learned by students in each 
phase. In Phase 1 three students learned eight words, four students 
learned seven words, three students learned six words, two learned five 
words, two learned four, two learned three, and one student learned two 
words. Teacher-rated ability level, but not age, appeared to be a 
predictor of success in learning. Within pairs of students, neither order 
of participation nor order of posttesting seemed to affect the number of 
words learned. Students were all in their ninth* month of kindergarten; 
our data suggest that high and average ability readers tended to learn 
more new words than low readers. 



Again in Phase 2, high and average ability students overall seemed to 
learn more words among the kindergarteners. AfTiong the high ability kinder- 
garteners one student learned B words, one learned 5 words, and one 
learned 2 words. This group overlapped with the two average students who 
learned 2 and 4 words respectively. The low ability kindergartener 
learned no words. Among the first graders the high ability students could 
already read most of the words. The ^our low and average ability first- 
graders learned B, 6, 6, and 7 words respectively. 



[Insert Table 4 about here] 




[Insert Tables 5 and 6 about here] 

\ 

In Phase 3 average and high ability children again were most suc- 
cessful, learning in the range oi one to eight words. Low ability chil- 
dren in Phase 3 learned no words. 

As already noted, Phase 4 students were unrated as to their reading 
ability. Nonetheless, both in age and classroom experience--al 1 had 
completed kindergarten ano had an average age oi about sir. and a haH 
years--they are comparable to the average and high ability end-o-f-year 
kindergarteners oi Phase 1 and the low and average ability beginning-o-f- 
year first graders in Phase 2. Indeed, an e>:amination oi Tables 4, 5, and 
7 bears out this comparison; the range oi new words learned in Phase 4 was 
between four and nine, with a mean of 6.2 (excluding one student who 
already knew all the words). For Phase 1, the range is four to eight new 
words learned, with a mean of 6.5; for Phase 2 the range iz six to eight, 
with a mean of 6.75. 

These results suggest that the likelihood of children's benefiting 
instructional ly from this particular educational application of speech 
recognition is related to their current reading and reading readiness 
levels. Specifically, our study indicates that the most effective use of 
our early reading program may have occurred with children who f't within a 
relatively well-specified range: end of kindergarten to beginning of first 
grade. Given that our reading ability levels were supplied by teachers, 
we conclude that teachers are able to make useful classroom judgments 
about which children would benefit from participation. 



In conclusion, this exploratory study suggests that speech recog- 



nition technology holds potential for such educational applications as 
beginning reading instruction. Perhaps our most significant finding to 
date is that human factors are an absolutely crucial ingredient in the 
successful application of speech recognition technology in education. 
These factors include microphone handling; responses to recognition 
errors; responses to prompts and rewards; and need for adult supervision. 
Our data also suggest that gain and reject threshold settings play impor- 
tant roles in recognition accuracy. Finding the proper reject threshold 
setting involves something of a tradeoff: a setting that decreases rejec- 
tion errors is likely to create an increase in misre,cognit:on — substitu- 
tion and insertion — errors. Finally, our research shows that microphone 
quality is a third inrsportant contributor to the system's performance. 
Further testing is indicated to determine more precisely the kind of 
microphone that will work most effectively and consistently for young 
children at the lowest cost. We hypothesize that with optical human 
factors conditions, proper gain and reject threshold settings, and an 
adequate microphone, speech recognition technology can be used effectively 
in education. 

Further testing of a stable group of participants over a perioii of a 
few weeks is also needed to determine whether templates would be usable 
from sessi**. to session, as would be expected on the basis of successful 
applications with adult users. Longer term use would also indicate 
whether the adult pattern of improved recognizer performance beyond first- 
time use would occur with children as well. Finally, repeated u^e would 
yield evidence as to whether greater familiarity would increase student 
enjoyment, independence, and learning with the system. 




t 



FOOTNOTE 



\1/This 99.3 percent accuracy level was obtained in a standard per-for- 
mance test ior speech recognition systems. The test is conducted in a 
sound-treated room, using a high performance, noise-canceling nicrophone. 
Training oi twenty vocabulary words is accomplished using ten tape- 
recorded utterances oi each word by sixteen speakers, haH oi whom are 
first-time users. Speech recognition performance is then evaluated on the 
basis of sixteen tape-recorded test utterances of the same group of 
speakers. See Doddington and Schalk (1981) for a detailed description of 
the standard evaluation procedures. 



• ERIC 




BIBLIOGRAPHY 



Baker, J.H* (1981, Fall). How to achieve recognition: A tutorial/ 

status report on autoaatic speech recognition. iQEEch^Technol ogy, 

Ooddington, 6.R., I Schalk, T.B. (1981, Septeiber). Speech recog- 
nition: Turning theory into practice. Sgectruij, pp. 26-32. 

HelliMell, J. (1984, June 26). Talking about voice activation. PC 
Magazine, pp. 171-188. 

Hunt, J. A., tc Handa, H.K. (1984, June). Speech recognition struggles to 
life. High^Technology, pp. 30-32. 

Petre, P. (1985, January 7). Speak, tasters Typewriters that take dicta- 
tion. Fortune, pp. 74-78. 

Sandberg-Dinent , E. (1984, Septeiber 11). Talking back to your coiputer. 
li§!!!-i2C!S-Iiig§f Sec. C, p. 3. 

Sandberg-Disent , E. (19B4, Septesber 18). Computer learns its aaster's 
voice. New^York^Times, Sec C, p. 4. 

Sandberg-Diaent , E. (1984, September 25). Hearing is not so easy either. 
NSB-YoclJ-IiSgSf Sec. C, p. 5. 



ERLC 



.S8 



Table 1. Nuiber oi ^'e^rainings needed to construct satisfactory 
tcaplates Qi first ten words. 



PHASE 1 


No. 






Retrainings 


kin 

NO • 


TOTALS 


nceocu 


Chi 1 dren 


0 


1 


0 


1 


3 


3 


2 


3 


6 




1 


3 


4 


2 


e 


s 


3 


15 


6 


1 


6 


7 


1 


7 


8 


0 


0 


9 


1 


9 


10 


1 


10 








17 


67 


Range: 0-10 


Averages 3.9 



PHASE 3 


No. 






Retrainings 


No. 




Needed 


Children 


TOTALS 


0 


1 


0 


1 


1 


1 


2 


4 


e 


3 


0 


0 


4 


1 


4 


S 


1 


5 


6 


1 


& 


7 


0 


0 


e 


0 


0 


9. 


1 


9 




10 


33 


Range: 0-9 


Averages 3.3 



ERIC 



PHASE 2 


No. 






Retrainings 


kin 

NO . 




Needed 


Chi Idren 


TOTALS 


0 


4 


A 

u 


1 


3 


0 




2 


4 


3 


3 


9 


4 


1 


4 


5 


1 


5 


b 


0 


0 


7 


0 


0 


e 


1 


8 




lii 


33 


Ranges 0-8 


Av?rages 2.2 



PHASE 4 


No. 


No. 

Children 


TOTALS 


Retrainings 
Needed 


"feed" 


h/o 
"feed" 


"feed" 


M/o 

"fPPd' 


0 


1 


1 


0 


0 


1 


1 


2 


1 


2 


2 


0 


0 


0 


0 


3 


0 


1 


0 


3 


4 


2 


2 


8 


8 


5 


1 


0 


5 


0 


6 


0 


0 


0 


0 


7 


1 


0 


7 


0 




6 


6 


21 


13 


Range m/ 'feed's 0- 
m/o 'feed's 0- 


-7 Average m/ "feed"t i.a 
.4 y/Q 'feed": 2.2. 



4 ^ 



39 



Table 2* Frequency of thrtt kinds of speech rtcognition errors in gaies 
and story. 



SAHES 




No. 
6a.es 


Re jecti on 

Errors 
No. m 


Substitution 
Errors 
No. m 


Insertion 
Errors 
No. {%) 


Total 
Errors 


Mean 

Errors/ 
tia.e 


Phase 


1 


16 


41 (46) 


12 


(13) 


37 


(41) 


90 


5.63 


Phase 


2 


20 


97 <96) 


3 


(3) 


1 


(1) 


101 


5.05 


Phase 


3 


13 


49 (94) 


3 


(6) 


0 


(0) 


52 


4.00 


Phase 


4 


9 


28 (65) 
(«/ "feed') 


IS 


(3S) 


o' 


(0) 


43 


4.80 








14 (61) 
(w/o "feed") 


9 


(39) 


0 


(0) 


23 


2.50 


STORY 




No. 

Stories 


Rejection 
Errors 
No. m 


Substitution 
Errors 
No. m 


Insertion 
Errors 
No. a) 


Total 
Errors 


Mean 

Errors/ 
Story 


Phase 


1 


16 


24 (86) 


1 


(3) 


3 


(11) 


28 


1.70 


Phase 


2 


12 


41 (95) 


2 


(5) 


0 


(0) 


43 


3.50 


Phase 




10 


46 (100) 


0 


(0) 


0 


(0) 


46 


4.60 


Phase 


4 


b 


1 (lOC'i 
(«/ "feed") 


0 


(0) 


0 


(0) 


1 


0.17 


HENU 




No. 

Utterances 


Rejection 
Errors 
No. (Z) 


Substitution 
Errors 
No. IX) 


Insertion 
Errors 
No. m 


Total 
Errors 


Hean 

Errors/ 
Utteranci 


Phase 


4 


93 


22 (92) 


2 


(8) 


0 


(0) 


24 


0.26 



ERIC 40 



\ 



Table 3. Error rates during games ^or selected participants in Phase 2, 3, and 4. 



PHASE 2 


Child 


No. 

Utterances 


No. 

Rejection 
Errors 


Percent 

Rejection 

Errors 


No. 

Nibrecognition 
Errors 


Percent 

Hisrecognition 
Errors 


1 


34 


9 


26 


0 


0 


2 


20 


4 


20 


0 


0 


3 


61 


9 


IS 


1 


2 


PHASE 3 


4 


41 


5 


12 


0 


0 


S 


23 


7 


30 


1 


4 


6 


8 


2 


25 . 


0 


0 


PHASE 4 


7 


37 


14* 


38 


1 


3 


B 


36 


8»» 


22 




22 


9 


42 


2 


5 


s«««« 


12 


10 


43 


I 


2 


1 


2 


11 


13 


3 


23 


c 


0 


12 


10 


0 


0 


0 


0 



» 12 0+ 14 errors involved the word "+eed" 

» 2 0+ 8 errors involved the word "+eed" 

♦ 6 0+ 8 errors involved the word "+eed" 

* 2 of 5 errors involved the word "feed" 



ERIC 



41 



. Table 4. Nueber of words learned in Phase 1 (out of ten). 



PHASE 1 



Kindergarteners— 9th lonth 



Teacher 
Rating of 
Ability 


Hcan 
Age 

Yrs-Hos 


Hean 

Pretest | 
Score 


Hean 

Posttest 
Score 


Slean 

Uoi-ds 

Learned 


(Range) 


LoM (n*5) 


5-7 


0.0 


3.6 


3.6 


(2-6) 


Average {n«5) 


6-0 


0.0 


6.4 


6.4 


(5-8) 


High {n»7) 


5-8 


1.1 


7.7 


6.6 


(4-8) 



42 



Table 5. Nuiber of nords learned in Phase 2 (out of ten). 





PHASE 2 
Kindergartencrs- 


-3rd Bonth 






Teacher 
Rating of 

AW « 1 i A» k* 

ADX 1 1 ty 


Hean 
Age 

Yrs-Hos 


Hean 

Pretest 

Score 


Hean 

Posttest 
Score 


Hean 

Mords 
Learned 


(Range) 


LoM (n^l) 


5-2 


0.0 


0.0 


0.0 




nverage \n*^i 


4-2 


0.0 


T ft 


T A 




High (n«3) 


5-3 


1.3 


s.o 


3.7 


<2-8) 


Unrated (n«l) 


5-4 


0.0 


2.0 


2.0 






First Braders-- 


•3rd ionth 






LoH {n>2) 


6-6 


0.5 


7.5 


7.0 


(6-8) 


Average (n*2) 


6-8 


2.0 


8.5 


6.5 


<6-7) 


High <n«4) 


6-6 


8.5 


10. 0 


1.5 


(0-4) 



43 

o 

ERIC 



Table 6. Nunber of nords learned in Phase 3 (out of ten). 





PHASE 3 
Kindergarteners- 


-4th tonth 




Teicher 

Rating of 
Ability 


Hean 
Age 

Yrs-Mos 


Hean 

Pretest 
Score 


Hean 

Posttest 
Score 


Hean 
Words 

Learned (Range) 


LON {n»3) 


5-4 


2.7 


2.3 


0.0 


Average {n«2) 


5-0 


0.5 


2.5 


2.0 (1-3) 


High (r.»5) 


5-3 


. 3.4 


7.4 


4.0 (2-B) 



44 



Table 7. Number of words Itemed in Phase 4 <out oi ten). 



PHASE 4 
Summer be-fore First Grade 


Teacher 


Mean 


Mean 


Mean 


Mean 


Rating oi 


Age 


Pretest 


Posttest 


Words 


Ability 


Yrs-Mos 


Score 


Score 


Learned (Range) 


Unrated <n=5*) 


6-4 


1.2 


7.4 


6.2 <4-9> 



♦ Table excludes data on one student who already knew all ten words at 
pretest. 



45 



